RNA-Seq Data Analysis ◾ 177
or healthy and diseased) but it can also be a complex study that includes more than a
single factor (factorial design). Once the study design has been determined as metadata,
inferential statistics is used to identify which gene has statistically significant change in
expression compared to the same gene in a reference sample. The fundamental step in
the differential expression analysis is to model the association between gene counts (Y)
and the covariates (conditions) of interest (X). The number of replicates is crucial for the
statistical differential analysis. Most of the time, the number of replicates in an RNA-Seq
study is small. Instead of non-parametric statistical analysis, most programs for RNA-Seq
data analysis use generalized linear models (GLMs) by assuming that the count data fol-
lows a certain statistical distribution. That approach also assumes that each RNA-Seq read
is sampled independently from a population of reads and the read is either aligned to the
gene g or not. When the read is aligned to the gene g, we call this a success and otherwise
is a failure. The process of random trials with two possible outcomes (success or failure) is
called Bernoulli’s process. Thus, according to the probability theory, the number of reads
(successes), Yg, for a given gene g from sample j follows a binomial distribution.
Y
n
gj
j
gj
π
(
)
~Binomial
,
(5.8)
Assume Ygj is the number of reads sequenced from sample j, nj represents the number of
independent trials in Bernoulli’s process,
gj
π is the probability of success (a read is aligned
to the gene g in sample j), and
gj
π
−
1
is the probability of failure
Assume also that for the gene g on the sample j and that gene has the length lg and read
count Ygj. All possible positions in g that can produce a read can be described as Ygj lg [30].
Thus, probability of success,
gj
π , is given as
Y l
Y l
gj
gj g
g
G
gj g
∑
π
=
=1
(5.9)
where G is the number of genes in the sample.
According to the binomial distribution, the mean of the read counts is given as
n
gj
j
gj
µ
π
=
×
(5.10)
The probability that the number of reads (X
x
gj
gj
=
) for a given gene is given by
P Y
y
n
y
gj
gj
j
g
gj
y
gj
n
y
gj
j
gj
π
π
(
)
(
)
=
=
−
−
1
(5.11)
However, since in RNA-Seq count data, a very large number of reads are represented and
the probability of aligning a read to a gene is very small, the Poisson distribution is more
appropriate than the binomial distribution if the mean of read counts of a gene is equal to
the variance as the Poisson distribution assumes.